# CSCI 210: Computer Architecture Lecture 20: Performance

Stephen Checkoway
Slides from Cynthia Taylor

# Circuitverse Tips

 Add more inputs to the gates if you're combining lots of inputs



# Circuitverse Tips

- Circuitverse will reorder inputs and outputs for subcircuits
- Label your inputs and outputs so you can tell what they are



# Circuitverse/Lab Questions?

# **CS History: Performance**

|                                          | Millions of<br>Instructions<br>per Second | Instructions<br>per cycle | Instructions<br>per cycle per<br>core | Year |
|------------------------------------------|-------------------------------------------|---------------------------|---------------------------------------|------|
| <u>UNIVAC I</u>                          | 0.002                                     | 0.0008                    | 0.0008                                | 1951 |
| IBM 7030<br>("Stretch")                  | 1.200                                     | 0.364                     | 0.364                                 | 1961 |
| <u>VAX-11/780</u>                        | 1.000 MIPS                                | 0.2                       | 0.2                                   | 1977 |
| Intel i860                               | 25 MIPS                                   | 1                         | 1                                     | 1989 |
| Intel Core i7<br>3770K (4-core)          | 106,924 MIPS                              | 27.4                      | 6.9                                   | 2012 |
| Raspberry Pi 2 (quad-core ARM Cortex A7) | 4,744 MIPS                                | 4.744                     | 1.186                                 | 2014 |
| Intel Core i5-<br>11600K (6-<br>core)    | 346,350 MIPS                              | 57.72                     | 11.73                                 | 2021 |

Note: Millions of instructions per second is a misleading metric as it does not tell us what the instructions do

#### Measures of "Performance"

- Execution Time
- Frame Rate
- Throughput (operations/time)
- Responsiveness
- Performance / Cost
- Performance / Power

#### Match (Best) Performance Metric to Domain

#### **Performance Metrics**

- 1. Network Bandwidth (data/sec)
- 2. Network Latency (ms per roundtrip)
- 3. Frame Rate (frames/sec)
- 4. Throughput (operations/sec)

|   | Online Games      | High-def video | <b>Torrent Download</b> | Server Cluster |
|---|-------------------|----------------|-------------------------|----------------|
| A | 4                 | 3              | 1                       | 2              |
| В | 4                 | 1              | 3                       | 2              |
| С | 2                 | 1              | 3                       | 4              |
| D | 2                 | 3              | 1                       | 4              |
| Е | None of the above |                |                         |                |

# Metrics for running a program

Execution Time – how long does it take to run?

CPI – (clock) cycles per instruction

Instruction Count – how many instructions does it have?

Clock cycle time

# A note on cycles per instruction

- Different instructions can take different lengths of time.
  - Multiplication and division take longer than arithmetic and logical operations
  - Floating point takes longer than integer operations
  - Memory instructions take longer than everything else

# All Together Now

```
CPU Execution Time = Instruction X CPI X Clock Cycle Time
```

# All Together Now



 You have a 1 billion (10<sup>9</sup>) instruction program, a 500 MHz processor, and an execution time of 3 seconds. What is the CPI for this program?

• Note that 1 MHz = 1 million (10<sup>6</sup>) cycles per second

| Selection | СРІ               |
|-----------|-------------------|
| Α         | 3                 |
| В         | 15                |
| С         | 1.5               |
| D         | 15*10^9           |
| Е         | None of the above |



$$\frac{\mathsf{CPU}\;\mathsf{Execution}}{\mathsf{Time}}\;=\;\begin{array}{c} \mathsf{IC} & \mathsf{CT} \\ \mathsf{Instruction} \\ \mathsf{Count} \end{array}\;\mathsf{X} \quad \begin{array}{c} \mathsf{CPI} \\ \mathsf{X} \end{array}\; \begin{array}{c} \mathsf{CIock}\;\mathsf{Cycle} \\ \mathsf{Time} \end{array}$$

There are a number of people involved in processor / programming design

 Each of these elements of the performance equation can be impacted by different designer(s)

Next slides will be about who can impact what.

```
CPU Execution Time = Instruction X CPI X Clock Cycle Time
```

• What can a programmer influence?

| Selection | Impacts           |
|-----------|-------------------|
| Α         | IC                |
| В         | IC, CPI           |
| С         | IC, CPI, and CT   |
| D         | IC and CT         |
| E         | None of the above |

What can a compiler influence?

| Selection | Impacts           |
|-----------|-------------------|
| Α         | IC                |
| В         | IC, CPI           |
| С         | IC, CPI, and CT   |
| D         | CPI and CT        |
| E         | None of the above |

```
CPU Execution Time = Instruction X CPI X Clock Cycle Time
```

What can an instruction set architect influence?

| Selection | Impacts           |
|-----------|-------------------|
| Α         | IC                |
| В         | IC, CPI           |
| С         | IC, CPI, and CT   |
| D         | CPI and CT        |
| E         | None of the above |

```
CPU Execution Time = Instruction X CPI X Clock Cycle Time
```

• What can a hardware designer influence? Assume they are designing a chip for a fixed ISA.

| Selection | Impacts           |
|-----------|-------------------|
| Α         | IC                |
| В         | IC, CPI           |
| С         | IC, CPI, and CT   |
| D         | CPI and CT        |
| E         | None of the above |

# If we run two different programs on the same machine, how do the number of instructions, CPI, and clock cycle time compare?

|   | Number of instructions | CPI       | Clock cycle time |
|---|------------------------|-----------|------------------|
| Α | Same                   | Same      | Same             |
| В | Different              | Same      | Same             |
| С | Different              | Different | Same             |
| D | Different              | Different | Different        |
| Ε | Different              | Same      | Different        |

If we run the same program on two different machines with different ISAs, how do the number of instructions, CPI, and clock cycle time compare?

|   | Number of instructions | CPI       | Clock cycle time |
|---|------------------------|-----------|------------------|
| Α | Same                   | Same      | Same             |
| В | Same                   | Same      | Different        |
| С | Same                   | Different | Different        |
| D | Different              | Different | Different        |
| Ε | Different              | Same      | Same             |

If we run the same program on two different machines with the same ISA, how do the number of instructions, CPI, and clock cycle time compare?

|   | Number of instructions | CPI       | Clock cycle time |
|---|------------------------|-----------|------------------|
| Α | Same                   | Same      | Same             |
| В | Same                   | Same      | Different        |
| С | Same                   | Different | Different        |
| D | Different              | Different | Different        |
| Ε | Different              | Same      | Same             |

# How we can measure CPU performance

Millions of instructions per second

Performance on benchmarks—programs designed to measure performance

Performance on real programs

# MIPS (not the name of the architecture)

MIPS = Millions of Instructions Per Second

= Instruction Count

Execution Time \* 10<sup>6</sup>

- program-dependent
- deceptive

# Speedup

 Often want to compare performance of one machine against another

```
Performance = \frac{1}{\text{Execution Time}}

Speedup (A over B) = \frac{1}{\text{Performance}_{A}}

Speedup (A over B) = \frac{\text{ET}_{B}}{\text{ET}_{A}}
```

#### Amdahl's Law

Execution time = after improvement

Execution Time Affected

Amount of Improvement

+ Execution Time Unaffected

#### Amdahl's Law and Parallelism

 Our program is 90% parallelizable (segment of code executable in parallel on multiple cores) and runs in 100 seconds with a single core.
 What is the execution time if you use 4 cores (assume no overhead for parallelization)?

Execution time = Execution Time Affected + Execution Time Unaffected Amount of Improvement + Execution Time Unaffected

| Selection | Execution Time    |
|-----------|-------------------|
| Α         | 25 seconds        |
| В         | 32.5 seconds      |
| С         | 50 seconds        |
| D         | 92.5 seconds      |
| E         | None of the above |

# Amdahl's Law

So what does Amdalh's Law mean at a high level?

| Selection | "BEST" message from Amdahl's Law                                                                                |
|-----------|-----------------------------------------------------------------------------------------------------------------|
| Α         | Parallel programming is critical for improving performance                                                      |
| В         | Improving serial code execution is ultimately the most important goal.                                          |
| С         | Performance is strictly tied to the ability to determine which percentage of code is parallelizable.            |
| D         | The impact of a performance improvement is limited by the percent of execution time affected by the improvement |
| Е         | None of the above                                                                                               |

# **Key Points**

- Be careful how you specify "performance"
- Execution time = IC \* CPI \* CT
- Use real applications, if possible
- Make the common case fast